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ABSTRACT 

Aims. A solution of the radiative-transfer problem in arbitrary velocity fields introduced in a previous paper, has limitations in its 
applicability. For large-scale applications, the methods described also require large memory sets that are commonly not available to 
state-of-the-art computing hardware. In this work, we modify the algorithm to allow the computation of large-scale problems. 
Methods. We reduce the memory footprint via a domain decomposition. By introducing iterative Gauss-Seidel type solvers, we 
improve the speed of the overall computation. Because of the domain decomposition, the new algorithm requires the use of parallel- 
computing systems. 

Results. The algorithm that we present permits large-scale solutions of radiative-transfer problems that include arbitrary wavelength 
couplings. In addition, we discover a quasi-analytic formal solution of the radiative transfer that significantly improves the over- 
all computation speed. More importantly, this method ensures that our algorithm can be applied to multi-dimensional Lagrangian 
radiative-transfer calculations. In multi-dimensional atmospheres, velocity fields are in general chaotic ensuring that the inclusion of 
arbitrary wavelength couplings are mandatory. 
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1. Introduction 

Radiative transfer in spherical symmetry and moving media 
has been solved with different methods. Operator-splitting tech- 
niques (Cannon 1973) remain the state-of-the-art methods. An 
operato r-splitting meth od that uses an approximate A-operator 
(ALO) (jScharmeiill981D - called accelerated A-iteration (ALI) 
- for the solution of the special relativistic Lagrangi an equation 
of rad iative transfer was described, for example, in iHauschildl 
dHH. ALa grangian description is necessary for the treatment 
of line transfer because of the complicated form of the emis- 
sivities and opacities in the Eulerian frame and also the large 
number of wavelength points needed to sample the spectral line 
at all points in the atmosphere. It follows that the velocity field 
mus t then appear explicitly in the equation of radiative trans- 
fer (lMihalaslll980h . For monotonic velocity fields, the solution 
of the radiative-transfer problem becomes an initial value prob- 
le m in wavelength th at can be solved with the method described 
in iHauschildtl (|1992|) . The inclusion of non-mono tonic velocity 
fields changes the problem into a boundary-value problem in 
wavelength. A so lution method and its alg orithm for this case 
was introduced inl Baron & Hauschildj ( l2004fl (Paper I). 

However, the applicability of the methods in Paper I depends 
on the size of the numerical system, which is determined mainly 
by the number of wavelength points as well as the number of 
layers in the 1-D model atmosphere. The number of wavelength 
points is the more critical factor, since it is commonly much 
higher than the number of layers in 1-D. The number of lay- 



ers is however important, because the number of entries in the 
A-operator described in Paper I is proportional to the number of 
layers squared. 

On average hardware, the algorithm from Paper I is only 
suited to solving problems with 64 layers and about 1000 
wavelength points because of the limited memory available. 
Configurations with more than 1000 wavelength points are 
called in the following, large-scale applications. 

The work from Paper I must be regarded as a proof of 
concept. In this work, we present a new algorithm that is ca- 
pable of efficiently solving large-scale applications. Besides 
being immediately useful for 1-D applications, the results 
are also applicable to the deve l opment of 3-D radiative 
transfer (Hauschildt & Baronl 120061 iBaron & Hauschildj l2007t 
Hauschildt & Baron 2008) . 

In Sect. |2] we summarise the formalism again for complete- 
ness. The approach for the new algorithm is described in Sect. [3] 
In Sect, m we present the results for test calculations and scala- 
bility tests before we conclude in Sect. |5] 

2. The formal framework in general 

The framework presented in lBaron & Hauschildt! (|2004|) was de- 
veloped for just one possible wavelength discretisation. During 
the course of this work, it became clear that a fully implicit 
discretisation (Hauschildt & Baron 2004) is necessary to solve 
the radiative-transfer problem in all generality. This is because 
for strong wavelength couplings the interpretation of the entire 



2 



Sebastian Knop et al.: Comoving-frame radiative transfer in arbitrary velocity fields - II. 



wavelength discretisation as a source term is not valid for all op- 
tical depths. Furthermore, the use of a new formal solution that 
avoids negative generalised opacities must be used ( Knop et al] 
120081) . Therefore, we discuss here the basic framework in those 
parts that are different from iBaron & Hauschildtl(l2004 . 

The equation of radiative transfer in its characteristic form 
for the specific intensity / along a path s reads: 



dl, d{AI,) 
as oA 



(1) 



where rj is the emissivity, x the opacity, and the subscript / in- 
dicates the dependence of the intensity on wavelength. The a/ 
term is the coupling term between the wavelengths that depends 
on both the st ructure of th e atmosphere and the mechanism of 
the coupHng (Mihalas 1980). 

The wav elength derivative can be dis cretised in two ways as 
described in lHauschildt & BaronI (l2004l) . The different discreti- 
sations can be mixed via a Crank-Nicholson scheme with a mix- 
ing parameter ^ e [0, 1]. The wavelength-discretised equation of 
radiative transfer can then be written as: 



d^ 

ds 



T]i-xili-ai{4 + ^ p^l)!, 
-[l-^]a, [pjh-i + p\li + p^ih+x) 



(2) 



where the p' coefficients in an ordered wavelength grid < 
Al < Ai+i are defined as: 
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The dependence on th e sign of q a is introduced to define local 
upwind schemes (see .Baron & Hau schildt (2004^)). 

Af ter introducing a generalised opacity (see iKnop et aP 
(I2008h ^ 



(5) 

(6) 
(7) 



Xi^Xi + ^ aip), 

as well as defining the source functions: 

c m 

= — 

XI 

Si = i^lsi-^^(pJI,-i+ptlM)] 
Xi\ Xi I 

5/ = - ^ ([1 - ^] [p-ih-i + pIIm) + [4 + [1 - ^] p\] I,) (8) 
Xi 

a formal solution of the radiative-transfer problem can be formu- 
lated. In our work, we use a characteristic method along different 
photon paths throughout the atmosphere. The spatial position of 
a characteristic is then discretised on a spatial grid. In the follow- 
ing, a pair of subscript indices indicate the position in both the 
spatial grid and the wavelength grid. Commonly the spatial grid 
is mapped onto an optical depth grid via the relation dr; = xids. 
The formal solution of the equation of radiative transfer in Eq.|2] 



between two points s,_i and s; on a spatial grid along the photon 
path can be written in terms of the optical depth as follows: 



/;,/ = Ii-i.ie-^" + SIi,i + 6Iij 
Siij 



(9) 
(10) 



SI, 



I Sie'' '"'dr = aijSi-ij +/3ijSij + jijS i+ij 

f 'Sie^-''dT = aijS 1-1,1 +Mi.i (11) 

with At - T,+i,/ -T,./andT,./ - r'yi(s)d s. Th e a-P-y coefficient s 
are described in lOIson & Kunaszl (119871) and lHauschild^ ( Il992h . 
In Eq.[TT]57/ is interpolated linearly and the coefficients differ in 
general from the coefficients in Eq. [TOl and is therefore marked 
with a tilde. 

Equation |9] can be written in matrix notation for any given 
characteristic: 



I = A I + Al 



(12) 



Here I is a vector containing all intensities, A is a square matrix 
that describes the influence of the different intensities upon each 
other, and Al is a vector with the thermal emission and scattering 
contribution of the source function. For a characteristic with n, 
spatial points and n; points in the wavelength grid, the intensity 
vector I has n, x «/ entries. In the following, a superscript of k 
labels the characteristic being described. The components of the 
matrix A from Eq. [12] at the spatial point / and the wavelength 
point / are given by: 



^ij = - i^O'i,/ + [1 - ^] a,-,/) Tj^Pl-l.l 

Xi-\j 

B;f = -{^f^j + n-^]^i.^^p;f 

Xi.l 

^ij = -^yuTj^PMj 

Xi+\i 
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(14) 
(15) 



A]:t =exp(-ATf_,,)-4^[4 + [l-f]pi.L\J (16) 

Xi-ij 

- -^ai[4+[i-^]p:.::i (17) 
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The naming scheme of the quantities defined in Eqs.[T3]tol22l 
indicates the specific intensity element with which they are as- 
sociated. For an index pair / and /, a superscript refers to an 
intensity at wavelength / - 1, a superscript to the same wave- 
length, and •'^ to the next wavelength point / + 1. The A, B, and 
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Fig. 1. Explicit matrix form of the formal solution for a characteristic of length k and / wavelength points. The horizontal Unes mark 
block borders of different wavelengths to clarify the structure. 

The matrix has three tridiagonal bands. The one on the main diagonal is called diag{- \) and the lower and upper accordingly 
sub(- -) and superi- +). The diagonals of these bands are called A, B, and C. 



C terms refer to the spatial points / - 1 , /, and / + 1 respectively. 
For clarity, the structure of the matrix of the formal solution is 
shown schematically in Fig.[T] 

An element of the source function vector AI is given by: 

'^I^j-'AKu+I^A + ^Au (23) 

From Eq. [12] the solution for the specific intensity at a given 
spatial point and wavelength reads: 

Given the form of Eq. |24] for the formal solution, the con- 
struction of the A*-operator can proceed exactly as described 
in Baro n & Hauschildt. (,2004,) and will not be discussed further 
here. 

3. Optimisation of the algoritlim 

In the following sections, we outline the changes made to the 
algorithm from Paper I to improve the performance and usability 
of the solution to the radiative-transfer problem in the case of 
large-scale appUcations. 

3. 1 . Smaller amount of memory requirements 

The solution of the equation of transfer for arbitrary wave- 
length couplings is a boundary-value problem in wavelength 



(iBaron & Hauschildl |2004|) . where the wavelength derivative 
sense changes throughout the atmosphere. This implies that 
the radiative transfer must be solved for all discrete wave- 
length points at the same time, and that all wavelength depen- 
dent quantities such as the opacities, interpolation coefficients, 
wavelength-derivative discretisation, and the A-operator must 
be kept in memory at the same time. For large-scale applica- 
tions, these requirements easily exceed the memory of com- 
monly available computer hardware. Therefore, the key method 
of solution is a domain decomposition of the data. Ideally, every 
process stores only the data that it works on. This immediately 
implies parallelisation of code execution as well. 

The formal solution can be parallelised. The formal solutions 
for different characteristics are independent of each other and 
accordingly a computing node in a parallel setup must store only 
the data for those characteristics upon which it works. 

The A*-operator is the largest data object that must be re- 
tained in memory, and also offers the most hopeful possibility 
for optimisation. The operator has the full spatial bandwidth but 
is only tridiagonal in wavelength. Therefore, the number of en- 
tries of the A*-operator is niayer x "layer X «^ X 3, where niayer is 
the number of discrete radial points in a spherically symmetric 
atmosphere and n,i is the number of discrete wavelength points. 

For a model atmosphere with 100 layers and 20 000 wave- 
length points, the operator takes up w 4.5 GB of memory. This 
easily exceeds the average memory per processor available in 
modern computing systems. The complete operator must be kept 
in memory for the solution of the ALI step if direct solvers such 
as the LAPACK package are used. The need to store the factori- 
sation actually doubles the memory requirements. However, this 
is not the case if an iterative method of solution is used for the 
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ALT step. Then different tasks can work on different parts of the 
index field that are iterated. This in turn means that only those 
parts of the operator that are needed for the local iteration of the 
new mean intensities need to exist in each task. The storage re- 
quirements are then reduced by a factor equal to the number of 
tasks involved in the computation. 

A drawback of this strategy of decomposing the opera- 
tor is that it greatly increases the need for communication be- 
tween tasks. The formal solution and its accompanying data are 
parallelised over the characteristics of the radiation field (see 
Sect. 13. 3t . A A*-operator element is influenced by all charac- 
teristics and therefore contributions to the element that is stored 
for just one process must be calculated by all processes and be 
communicated. A parallelised iterative solution of the mean in- 
tensity in the ALO step is also enforced, increasing the need for 
communication even more (see Sect. l3.4l ). 




Fig. 2. Flow of information along a characteristic and wave- 
length for alternating wavelength couplings. 



3.2. Optimisation of tine speed of tine formai soiution 

To improve the overall computational speed, the time take to 
complete a formal solution must be decreased because it is per- 
formed most often in an ALI s olution. In small-sc ale appli- 
cations, the SuperLU package (IDemmel et al.l [l999l) provided 
an efficient solver for the matrix equation of the formal solu- 
tion along a characteristic. However, there was room for im- 
provement, e.g., in terms of memory footpri nt, so we developed 
an ite rative Gauss-Seidel (GS) type solver dOolub & Van LoanI 
Il989l) . which proved to have a minimal memory footprint as 
well as to be very fast. The main advantage of this new solver 
is, however, that for a linearly interpolated source function S, 
the formal solution, becomes quasi-analytic and there is no need 
at all to solve a matrix equation. 

In principle, our method is a standard GS type iterative solver 
that uses a physically guided index field stepping scheme. That 
means that we use knowledge of the physics along the charac- 
teristic to increase the convergence of the iteration by a huge 
factor. The GS method does not prescribe the order in which 
the elements of the system are iterated. Therefore, we can freely 
choose the order of the steps in the solution of the linear system. 
We choose it so that we follow the motion of a pulse along the 
characteristic, hence follow the physical flow of information in 
the system. Because we know that the information will always 
be propagated along the characteristic, we are left only with the 
task of determining whether the information flows to longer or 
shorter wavelengths at any given spatial point. This problem has 
been already solved in the construction of the approximate A- 
operator and its solution can be reused here. 

To clarify the process of stepping, the flow of information 
along a characteristic for alternating wavelength couplings is 
shown schematically in Fig. |2] The arrows mark the direction 
- in space and in wavelength - in which the information flows 
along a characteristic at wavelength A. It is obvious that the infor- 
mation flows only along the characteristic and along the wave- 
length derivative sense determined by the sign of a/. 

In the following, the intensities of the nth iteration are writ- 
ten with an additional superscript (n). The iteration step for a 
specific intensity at a point / and wavelength / of a characteristic 
k can then be written for aj^, > as 

1,1 \ ij / \ Ij ij ij-l 

+ A7;*/*'(";V+^Y'f''''w'' + c:;*/*'<"! ,) (25) 

(,/ l-lj-l ij 1-1,1 ij l+lj-l / ^ ' 



and for taj^^ < as 

^^t^^Kf^^C\f^S.^ (26) 

where all coefficients that vanish have been omitted for the given 
sign of fl/. 

From Eqs. I25l426l it can be seen that in the case of linear in- 
terpolation of the S I source function, the scheme becomes inde- 
pendent of elements with the iteration order (n) and is therefore 
quasi-analytic since it depends only on elements with the same 
iteration order. 

In explicit form, the formal solution is given for flf , > by: 
and for a^.^ < by 

Vi,/ y^^> 

The formal solution remains a boundary condition problem, 
but it can be solved as an initial value problem at every spatial 
point. The solution is then direct and its speed is optimal. 

3.3. Caicuiation of tine formai soiution matrix 

All non-zero entries in the Matrix A - also called characteris- 
tic data in the following - from Eq. [12] must be known before 
the formal solution can be calculated. In the previous algorithm, 
the calculation of this characteristic data was parallelised over 
wavelength to ensur e optimal scaling with the number of pro- 
cessors (see Fig. 3 in lBaron & Hauschildtl (l2004t) ). The problem 
with this approach is that every process also calculates data it 
does not need. Furthermore, the data was written to disk to al- 
low other processes to access the data in case it was needed. 
This was developed to allow for small memory demands since 
the data for a characteristic could be loaded just before the cal- 
culation and deleted afterwards. This strategy proved to be trou- 
blesome for large numbers of wavelength points because the I/O 
performance was the limiting factor in the calculations. In non- 
parallel file systems, the simultaneous writing and reading of the 
data files also proved to be a severe bottle neck. The severity of 
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this problem can be reduced if a server process does all the I/O 
and distributes/receives the data to/from the client processes. 

Optimal performance was achieved if the setup for the calcu- 
lation of characteristic data was parallelised over characteristics 
instead. Every process then calculates the data it will need and 
stores it directly in memory. This completely removes the need 
for I/O and increases the speed by a large factor. Concerns about 
load balancing proved to be unfounded. The calculation of the 
necessary data is so fast that the inbalance in the load is not a 
factor This setup has the disadvantage that for large-scale appli- 
cations the calculations cannot be performed on a small number 
of processors. Because the data files on disk are absent, the data 
must be kept in memory at all times and thus a larger number of 
processors is needed to perform the domain decomposition ef- 
fectively. In practice, this is not a problem since the number of 
processors is normally chosen to be large to reduce the overall 
computing time anyway. 

3.4. Iterative solution of the ALO step 

As described in Sect. 13.1! the domain decomposition of the ALO 
reduces the memory requirements. However, one then must use 
an iterative solution for the ALO system. The convergence of 
iterative solvers is also likely to be very good because the source 
function in the later iterations will already be close to the final 
solution, whereas direct solutions cannot take advantage of this. 
Further iterative solvers are well suited to keeping the memory 
footprint small because they do not have to keep additional data 
in memory besides the linear system. 

We implemented GS and Jacobi type solvers to solve the lin- 
ear system of the ALI step: 

(1 - e A*) 7 = - e AV"'"* (29) 

where J is the mean intensity and e - ^, where <t is the scat- 
tering part of the opacity x- The mean intensity from the for- 
mal solution is J^^ and J°^^ is the result from the previous ALO 
step. The A*-operator is tridiagonal in wavelength, but has the 
full spatial bandwidth and its elements are identical to the cor- 
responding A-operator elements. The three different bands in 
wavelength are called A*,A;^, and A*. An example of A* in 
Eq.|29]at the wavelength I can then be written as: 

A* [J,] = AUm + A*^ Ji + A:7,_i (30) 

Equation|29]can then be rearranged into the following form 

Jm,l = (l - fm,/^\,m,m,/) ' (-^S " Yj ^"''■^nf^\,m,n,l 

n 

Zc: /old A * _ V ^ /old A * 

n n 

ni^m n 

+ ^ A! „,„ ,+1 j, (31) 

n 

where n and m are indices for the number of layers. Equation!?!] 
can be readily used in the GS and Jacobi iteration schemes. 

Because the mean intensity at a wavelength / depends on 
quantities at the wavelengths 1-1,1, and l+l, the domain decom- 
position of the ALO in wavelength must be performed blockwise 
for the GS method to be applicable. These blocks must overlap 
with one wavelength point at the boundaries to minimise com- 
munication. 




Fig. 3. Schematic overview of the parallelisation strategy 
for four processes. Backward-shaded areas (\) indicate 
characteristic-dependent data and forward-shaded areas (/) 
wavelength-dependent data. 

3.5. Summary of the optimal parallelisation 

The parallelisation strategy is the key element for the compu- 
tation of large-scale applications in the framework of radiative 
transfer with arbitrary wavelength couplings. As a summary, the 
most important aspects are schematically shown in Fig. [3] 

4. Test calculations 

We present the results of test calculations that we performed to 
test the new algorithm in terms of speed, scalability, and by re- 
gression testing. 

In the following, model calculations have about 20 000 
wavelength points and 64 layers unless noted otherwise. This 
setup would have been impossible with the old algorithm for ra- 
diative transfer as the memory of an average computing node 
would have been by far exceeded. Unless noted otherwise, the 
computations were performed on 1 800 MHz AMD Opteron 244 
CPUs with 4 GB RAM per CPU. 

4.1. Comparison to old algorithm 

We compare the new radiative-transfe r algor ithm with the meth- 
ods described in lBaron & Hauschildi (|2004|) . We show that the 
improvements to the formal solution and the ALO step solver 
described in Sect. [3] are significant. 

4.1 .1 . Formal solution 

We compare the speed of two solvers of the formal solution: 
the SuperLU solver package and our quasi-analytic solution (see 
Sect. l3.2l l. The comparison is unfair, because the LU decomposi- 
tion cannot take advantage of the special character of the matrix. 
However, the comparison demonstrates clearly that the quasi- 
analytic solution is needed to calculate large-scale applications 
in the given framework, since even fast and sophisticated solvers 



Sebastian Knop et al.: Comoving-frame radiative transfer in arbitrary velocity fields - II. 
Formal solution speeds ALO step timing 



100 
80 

S 60 

(D 

B 

40 
20 



Optimal 
SuperLUI 




Fig. 4. Comparison of the mean times needed for the complete 
formal solution using the optimal algorithm and SuperLU for 
different numbers of processors. 
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Fig. 6. Comparison of the iteration times of the approximate A- 
iteration for different algorithms. 
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Fig. 5. Comparison of the mean time (left) as well as the memory 
footprint (right) needed for the formal solution of the longest 
characteristic using the optimal algorithm and SuperLU. 



such as SuperLU are not fast enough to ensure that the calcula- 
tion is practically feasible. 

In Fig. m the mean time needed for a formal solution is 
shown for the two solvers and different numbers of processors. 
It is obvious that the optimal solution outperforms the SuperLU 
package. The SuperLU solution benefits well from an increase 
in the number of processors. This is also true for the optimal 
solution, but the effect is not as dramatic because the times are 
already very short (see Sect. |4.2."3l for the scalability of the for- 
mal solution). 

In Fig.|5] the time and memory consumption comparison for 
the different solvers is shown for the formal solution along the 
longest characteristic in the atmosphere. This allows a more di- 
rect comparison of the solutions. The speed advantage of the 
optimal solution is again obvious. The memory footprint of the 
optimal solution is also less than half as large as for SuperLU. 

The possibility of an analytic solution and the resulting fast 
formal solution open up the possibility of calculating radia- 
tive transfer in multiple, spatial dimensions with a characteristic 
method in the Lagrangian frame, while allowing non-monotonic 
velocity fields or other arbitrary wavelength couplings. If the 
speed of the formal solution were not optimal, the vast number 



of characteristics would increase the computation time beyond 
feasibility. 

4.1.2. Iterative ALO step 

In Fig. |6] we compare the iteration speeds of the ALI for dif- 
ferent algorithm^]. The LAPACK solver has a special role here, 
since it was the method of choice in the previous algorithm. As a 
direct solver the speed is always the same because it cannot take 
advantage of the benefits provided by a source function that is 
already close to the solution. 

The opposite is true for the iterative solvers. Here we show 
the results for Jacobi and GS solvers in serial mode as well as 
in parallelised versions. The serial version of the Jacobi solver 
is significantly slower than the direct solution. It becomes com- 
parable in efficiency with the LAPACK solver in the last few 
iterations because it makes use of the convergence of the source 
function. 

The serial GS and the parallelised Jacobi solver have simi- 
lar speeds in the given example. For the first iterations, they are 
slower but after about one third of the iterations they become 
faster than the LAPACK solver. This results in an better overall 
superior performance for the complete ALI. 

The parallelised GS solver is even faster than the LAPACK 
solver from the beginning and provides the best performance of 
all solvers, while keeping the memory requirements to a mini- 
mum. 



4.2. Scalability of the new algorithm 

By considering scalability, we describe the increased speed and 
reduced memory footprint of the algorithm when the strategies 
from Sect. [3] are applied to different numbers of processors. 

In the following, we present the results for the domain de- 
composition of the ALO, the ALO construction times, the for- 
mal solution speeds, and the ALO iteration speeds. 



' Tiie number of wavelengtii points had to be reduced to about 10 000 
in this example due to the increased memory footprint of the LAPACK 
solver. Otherwise the calculation would not have been possible on the 
available hardware. Furthermore, this test was performed on a different 
computer (HLRN-I) than the other calculations. Hence, direct speed 
comparisons with other results in this work are invalid. 
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Fig. 7. Comparison of the maximum memory used between 
decomposed and non-decomposed algorithms of the radiative 
transfer for 10, 20, and 30 processors. 



4.2.1 . Domain decomposition of tine ALO 

Here we show the drastic effect that the domain decomposition 
of the ALO has on the overall memory consumption. 

In Fig.|7] the maximal allocated memory is shown for algo- 
rithms with and without a decomposed ALO for 10, 20, and 30 
processors. The maximum memory allocated not only includes 
the ALO but also all other data, such as the formal solution data 
and the opacities. 

For the non-decomposed as well as decomposed algorithms, 
the memory footprint reduces when the number of processors 
is increased. In the non-decomposed case, this is only caused 
by the reduction in the characteristics data that must be kept in 
memory. In the decomposed case, the smaller amount of ALO 
data that must be stored further decreases the memory usage. It 
is obvious that the decomposed algorithm can reduce the mem- 
ory requirements sufficiently for it to be used on average current 
hardware. 



4.2.2. Construction of the ALO 

One of the drawbacks of a decomposed ALO is its distributed 
construction (see Sect. 13. 11 1. Figure |8] shows how the construc- 
tion time of the decomposed ALO compares with the construc- 
tion time of the non-decomposed ALO for different numbers of 
processors. 

Because the decomposed ALO parts are not calculated by 
one task alone but from all formal solution tasks, the construc- 
tion time increases. However, the more formal solution tasks 
present, the faster the construction becomes. This can also be 
seen in Fig. [8] For a highly parallelised calculation with 30 pro- 
cessors, the decomposed construction is only « 1 .4 times slower 
than in the non-decomposed case. Since the construction of the 
ALO must be performed only once during a full ALI, this draw- 
back is not significant in the light of the smaller memory foot- 
print that the decomposition provides. 

4.2.3. Formal solution 

The formal solution is the routine that is called the most often 
during a full ALI solution. Hence, it is especially important that 
it be as fast as possible. In Fig.|9] the mean times for a full opti- 



120 
100 

^ 80 

m 

S 60 

40 

30 




undecomposed 
decomposed | 




Fig. 8. Comparison of the construction times of the approximate 
A-operator between non-decomposed and decomposed setups. 
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Fig. 9. Overview of the timings for the formal solution for dif- 
ferent numbers of processors. 



mal formal solution are shown for different numbers of proces- 
sors. 

The formal solution speed scales with the number of proces- 
sors. However, the scaling is not linear The total time needed for 
a complete formal solution is limited by the time of the process 
with the largest characteristics set involved in the computation. 

The overall speed of the local formal solutions is determined 
by the number of characteristics and the total number of spatial 
points along characteristics that a task must handle. It follows 
that an increase in the number of processors produces a signif- 
icant increase in speed only if the number of characteristics on 
the slowest task is reduced. 

The formal solution achieves its optimal speed when there 
are at least as many tasks as there are characteristics in the sys- 
tem. 



4.2.4. ALO iteration speeds 

After the formal solution, the ALO step is the next most time- 
consuming part of the calculation. In Fig. [10] the iteration times 
for a full ALI are shown for 6, 12, 18, and 36 processors. In these 
calculations, the parallelised GS-type solver has been used. 

As the source function comes closer to its final value, the 
iteration speed increases for all calculations. Furthermore, it 
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Fig. 10. Comparison of the iteration times of the decomposed 
parallelised GS type approximate A-iteration for a different 
number of processors. 



is clear from Fig. [TO] that the overall speed increases with the 
number of processors, although the scaling is clearly non-linear 
Nonetheless, the overall computing time of the ALI increases as 
the number of processors working on it increases. The asymp- 
totic scaling results from the need for communication between 
all ALI tasks after each GS step to update the iterated solution. 

4.3. Regression test for supernova atmosphere 

The formalism and the old algorithm were compared with an- 
other solution for the case of monotonic velocity fields in 
Paper I, although the tests were limited to a toy model. With the 
new algorithm, we are now in the position to calculate full atmo- 
sphere models. Hence, it is instructive to repeat the regression 
tests in the new framework. As an example of a large-scale ap- 
plication, we compare the spectra from a supernova atmosphere 
calculation with 100 layers and about 20 000 wavelength points. 
As a reference for the solution of the radiative transfer in a mono- 
tonic velocity-fiel d we used the well tested and established algo- 
rithm described in lHauschiIdtl(ll992h . 

The resulting comoving spectra from the old and new algo- 
rithm diff'er for only about 20 wavelength points in the first ten 
leading digits and the maximal relative differences are of the or- 
der of 10 ^ 

This magnitude of difference is at first unexpected, since 
both radiative-transfer algorithms use an internal relative con- 
vergence criterion of 10"**. To understand the remaining differ- 
ences, we recall that the iteration procedure differs substantially 
between the two algorithms. In the recursive solution, the trans- 
fer is solved wavelength by wavelength and the ALI step will 
stop as soon as the internal criterion is reached. This is not true 
for the matrix-based solution, in which all wavelength points are 
iterated at the same time. This means that the ALI will continue 
to iterate the solution also at those wavelength points, which are 
already internally converged. 

More importantly, the solution from the previous wavelength 
point is fixed in the recursive scheme and accordingly the influ- 
ence of the wavelength coupling does not change in the solution 
of the radiative transfer at any given wavelength point. The op- 
posite is again true for the matrix-based solution, since the solu- 
tion is obtained for all wavelength points simultaneously. This 
means that the solution converges more successfully at most 
wavelength points in the matrix-based solution, since the solu- 



tion is consistent with the convergence criterion for all wave- 
lengths at the same time. 

Taking the points above into account, the overall match of 
the two solutions is excellent and no residuals were found in our 
test calculations. 



5. Conclusion 

We have presented algorithm strategies and details of the solu- 
tion of the radiative-transfer problem in atmospheres with arbi- 
trary wavelength couplings that are suited to the treatment of 
large-scale applications. The main aim of the optimisations of 
the existing framework has been the reduction of the memory 
usage to make the calculations feasible on currently available 
hardware. This has been achieved by domain decomposition of 
the data and parallelised code execution. In addition, the speed 
of the formal solution, the calculation of its matrix elements, 
and the ALI have all been vastly improved. The speed of all new 
algorithms scales with the number of processors used in the cal- 
culations. Although the scaling is non-linear, the overall com- 
putation time is still significantly reduced by an increase in the 
number of processors. 

We are now in a position to calculate large-scale model atmo- 
spheres that include alternating wavelength couplings - as from 
non-monotonic velocity fields or general-relativistic wavelength 
shifts. 

Future possible applications are the velocity profiles of cool 
stellar winds, the treatment of partial redistribution, and the 
calculation of radiative transfer in shock fronts as in accre- 
tion shocks. However, most promising are the prospects for the 
transition of radiative transfer to multiple spatial dimensions. 
Because of the good scaling of the formal solution and the ALO 
step with the number of processors, the algorithm strategy can 
be reused for 3D calculations. 
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